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Abstract 

Compared to machines, humans are extremely good at 
classifying images into categories, especially when they 
possess prior knowledge of the categories at hand. If this 
prior information is not available, supervision in the form 
of teaching images is required. To learn categories more 
quickly, people should see important and representative im¬ 
ages first, followed by less important images later - or not at 
all. However, image-importance is individual-specific, i.e. 
a teaching image is important to a student if it changes their 
overall ability to discriminate between classes. Further, stu¬ 
dents keep learning, so while image-importance depends on 
their current knowledge, it also varies with time. 

In this work we propose an Interactive Machine Teach¬ 
ing algorithm that enables a computer to teach challeng¬ 
ing visual concepts to a human. Our adaptive algorithm 
chooses, online, which labeled images from a teaching set 
should be shown to the student as they learn. We show that a 
teaching strategy that probabilistically models the student's 
ability and progress, based on their correct and incorrect 
answers, produces better 'experts’. We present results us¬ 
ing real human participants across several varied and chal¬ 
lenging real-world datasets. 

1. Introduction 

Large, manually annotated image datasets have con¬ 
tributed to recent performance increases in core computer 
vision problems such as object detection and classifica¬ 
tion misiiia. In cases where the visual categories of in¬ 
terest are generic everyday objects, annotation can be com¬ 
pleted by crowd sourcing labels from the internet using ser¬ 
vices such as Mechanical Turk ||T|. A typical image labeling 
task begins with a set of instructions to the annotator, show¬ 
ing them example images from the classes of interest. The 
annotator is then asked to assign class labels to new images 
where the ground truth is unknown. 

But what happens if the annotator is unsure? This is 
a real problem when annotators are incorrectly assumed 
to have prior knowledge of the classes of interest from 



Figure 1: In Interactive Machine Teaching the computer 
teaches the human learner, one image at a time. It be¬ 
gins by showing them an image from the larger labeled 
image dataset while concealing the true class label. The 
learner/student responds with their estimate of the image’s 
class. The teacher then updates their model of the student, 
and finally reveals the correct answer to them. This process 
is repeated with further images until teaching ends. 

which they can generalize, either from everyday life or from 
specialized training. For many problems, highly special¬ 
ized, domain specific knowledge, acquired through exten¬ 
sive training, is needed before someone can differentiate 
between potentially multiple, highly self-similar object cat¬ 
egories. 

Designing the set of teaching images or ‘teaching set’ to 
show the annotators is challenging, because each annotator 
will have a different degree of expertise. While it is possible 
to model the uncertainty and noise generated from groups of 
annotators to improve their collective performance O |40l 
1271, these approaches tend to downweight votes from weak 
annotators by learning to trust the experts. In this paper, we 
pose the question - how does one become an expert? We 
posit that a human’s discriminative ability for a given visual 
classification task can be improved by better modeling the 
teaching process required to make them experts. 

The family of methods referred to as Machine Teach¬ 
ing offers a general solution to the problem of teaching hu¬ 
mans ||43l[T8l|42|[39l|33l. Machine Teaching is not the same 
as Active Learning 1381 . In Active Learning, the computer’s 
goal is to learn more accurate models given the smallest 
amount of supervision. This is achieved by carefully select¬ 
ing only the most informative datapoints to be labeled by 
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the human. In Machine Teaching, the computer, rather than 
the human, is the (perfect) oracle and is tasked with deliv¬ 
ering a teaching set to the human student to help them learn 
the given task more effectively. Teaching the human a new 
skill is useful in its own right. Further, they are now better 
positioned to accurately annotate the additional unlableled 
data outside the original teaching set. Automatic teaching 
algorithms have applications in many domains from educa¬ 
tion, language learning, medical image analysis, biological 
species identification |[3Ql . and more. Crucially, for auto¬ 
mated teaching to be effective, it needs to be able to as¬ 
sess the student’s current knowledge, and have a mecha¬ 
nism for selecting teaching examples to best improve this 
knowledge. 

In this work, we focus on the task of image classifica¬ 
tion. Here, it is not possible for the teacher to directly 
‘teach’ the high-dimensional decision boundary to the hu¬ 
man learner, so instead, the student must learn this boundary 
by being shown teaching images. Our goal during teach¬ 
ing is to choose teaching images that will maximize the 
student’s classification ability in the minimum amount of 
teaching time. Unlike computers, humans have both lim¬ 
ited and imperfect memory for instance-level recognition, 
especially during the initial learning of a task ifTTll . How¬ 
ever, humans have the advantage of possessing the ability to 
generalize to unknown examples and perform domain adap¬ 
tation given only few instances. The majority of previous 
work in Machine Teaching has focused on non-interactive 
teaching, where one teaching set is computed offline, in¬ 
dependent of feedback from each student |[39l|33l. In this 
work, we address the under-explored problem of interactive 
teaching (MHH. Here, the teacher can adapt their teaching 
set online, based on the current performance of the individ¬ 
ual student (see Figure [^. 

We propose an algorithm that interactively teaches mul¬ 
tiple visual categories to human learners. Our contributions 
are threefold: 1) Unlike computers, humans are not opti¬ 
mal learners. Our algorithm models student ability online, 
resulting in teaching sets that are adapted to each individ¬ 
ual student. We make no assumptions regarding the in¬ 
ternal learning model used by the students. Instead, we 
present them with teaching images that attempt to reduce 
their predicted future uncertainty based on an estimate of 
their current knowledge. 2) Our teaching algorithm reduces 
the amount of time it takes students to learn categorization 
tasks involving multiple classes. Experimentally we show 
that real human participants, using our algorithm, perform 
better than other baselines on several challenging datasets. 
3) Finally, we provide a web based interface and frame¬ 
work for exploring new teaching strategies. Our intention is 
that this will encourage the development of new and diverse 
teaching strategies for a variety of human visual learning 
tasks. 


2. Related Work 

Here we cover the most closely related work in Machine 
Teaching. As we are concerned with the task of image cate¬ 
gorization, we focus on research concerning teaching classi¬ 
fication functions. However, it is worth noting that different 
types of teaching tasks have been explored in the literature, 
e.g. sequential decision tasks IH. How humans acquire and 
represent categories is an active area of research in visual 
psychology. Many candidate models for category acquisi¬ 
tion and representation in humans exist, and for an overview 
we direct the readers to (281 [35l. In this work, our goal is 
not to model these internal processes directly, but to instead 
treat the human as a stochastic black box learner. For conve¬ 
nience, we divide the related work in Machine Teaching into 
two areas - batch (fixed) teaching and interactive (adaptive 
or online) teaching. For a recent, and general, introduction 
to Machine Teaching, please see ll43]| . 

Machine Teaching - Batch (Fixed) 

In batch-based teaching, the teacher’s goal is to construct 
an optimal set of teaching examples offline, which are then 
presented to the student during teaching. Early work in 
this area focused on the theoretical analysis of the teach¬ 
ing dimension ca. The teaching dimension is defined as 
the minimum number of examples required from a given 
concept to teach the concept to a student. Like many other 
works in teaching, ca makes the simplifying assumption 
that the student has perfect memory {i.e. once shown an ex¬ 
ample the student will remember it in the future) - an as¬ 
sumption that is violated in real world teaching. Other the¬ 
oretically motivated works, while interesting, provide little 
validation on real human subjects |[3 HT3ll46ll . 

More recently, Zhu (4^ attempted to minimize the joint 
effort of the teacher and the loss of the student by opti¬ 
mizing directly over the teaching set. The proposed noise- 
tolerant model assumes that the student’s learning model 
is known to the teacher, and that it is in the exponential 
family. In follow-on work, Patil et al. maintain that 
unlike computers, which have infinite memory capabilities, 
humans are limited in their retrieval capacity. Motivated 
by real human studies ini, they show that modeling this 
limited capacity improves human learning performance on 
tasks involving simple one-dimensional stimuli. 

Most related to our work, Singla et al. (3^ teach binary 
visual concepts by showing images to real human learners. 
Their method operates offline and tries to find the set of 
teaching examples that best conveys a known linear clas¬ 
sification boundary. Experiments with Mechanical Turkers 
show an improvement compared to other baselines, includ¬ 
ing random sampling. Their approach attempts to encode 
some noise tolerance into the teaching set, but is still unable 
to adapt to a student’s responses online during teaching, be¬ 
cause the ordering of the teaching images is fixed offline. 


Machine Teaching - Interactive (Adaptive) 

Real human students are often noisy, especially in the 
early stages of learning when the concepts to be learned are 
not formed in their minds. Additionally, students do not all 
learn at the same rate - concepts that are difficult for some 
students may be easier for others. In Interactive Teaching 
(Figure [^, the teacher receives feedback from the student 
as teaching progresses. Given this feedback, the teaching 
strategy can adapt to the current ability of an individual stu¬ 
dent over time. 

Using a probabilistic model of the student and a noise- 
free learning assumption, Du and Ling da propose a teach¬ 
ing strategy called ‘worst predicted’. This strategy is sim¬ 
ilar to uncertainty sampling, which is commonly found in 
Active Learning ||38]| . However, unlike Active Learning, in 
Machine Teaching the teacher has access to the ground truth 
class labels and can use this to assess the student’s perfor¬ 
mance during teaching. Experimentally, we show that their 
strategy performs sub-optimally as it only seeks to show 
the student the image that they are currently most uncertain 
about, without regard for how informative that image may 
be in relation to others. As a result, it is very susceptible to 
teaching outliers, i.e. unrepresentative images at the fringes 
of the teaching set. 

In one of the few interactive teaching papers that deal 
with visual concepts, Basu and Christensen 01 evaluate 
human learning performance in binary classification using 
three different teaching methods. Students were tasked with 
classifying simple synthetically generated (and linearly sep¬ 
arable) depictions of mushrooms into one of two categories. 
They do not explicitly model labeling noise from the stu¬ 
dent, but instead investigate different interface designs and 
feature space exploration methods to help teach the stu¬ 
dents. 

In this paper, we address the problem of interactive 
multi-class teaching with real images by directly modeling 
the student’s ability as they provide feedback during teach¬ 
ing. 

3. Machine Teaching 

In this section we formally define our Machine Teach¬ 
ing task. Our teacher-computer has access to a labeled 
dataset V = {(xi, ^i),..., (x^v, ^at)} where each x^ is 
an M dimensional feature vector encoding an image J^, 
and yi G {1,..., C} is its corresponding class label. The 
teacher’s goal is to ‘teach’ the classification task to the hu¬ 
man learner by showing them images from the dataset V. 
We refer to these teaching images as the ‘teaching set’, Vt, 
a subset of images from V where \Vt\ ^ |D|. In each 
round of interactive teaching, the teacher first selects an im¬ 
age represented by the feature vector xt to show to the hu¬ 
man learner. The teacher displays images to the students 


as it is not possible to directly show them the high dimen¬ 
sional feature vector x^. The selection of the image to show 
is based on a process we refer to as the ‘teaching strategy’, 
S, used by the teacher. First, the teacher only shows the 
image and does not yet display the ground truth class label. 
By not revealing the class label, the teacher is able to ask 
the student to state which class they believe the image be¬ 
longs to. After receiving the student’s response, the teacher 
then updates its model of the student, and then reveals the 
ground truth label. Teaching proceeds for a set number of 
teaching rounds, and during each iteration, the teacher ac¬ 
quires a better understanding of the student’s current ability. 
Figure [^outlines one teaching iteration. 

With access to ground truth, the teacher trivially knows 
the conditional distribution P{yi \ x^) for each datapoint 
Xi. The student learner has a corresponding distribution 
Pi I x), based only on training examples they have seen 
so far. During teaching, the teacher seeks to minimize the 
student’s expected loss 

Ex = I/(P(2/|x),P;(y|x)), (1) 

over the dataset, where I/() is an appropriate classification 
loss function. However, the teacher has no way of directly 
observing the student’s true class conditional distribution. 
Pi I x), so instead must approximate it as Pi (^ | x). In this 
paper, we represent Pi {y \ x) using a probabilistic, semi- 
supervised, classifier. 

3.1. Teaching Strategies 

The optimal teaching strategy is the one that minimizes 
the student’s expected loss from Equation ©• A simple 
strategy for choosing the next teaching image is to randomly 
sample from the dataset V. Random sampling (Smd) does 
not model the student and is therefore unable to adapt to 
their ability. This lack of adaptation can manifest itself in 
two ways - 1) redundantly presenting teaching examples 
of concepts that have already been learned by the student, 
and 2) not directly reinforcing concepts that the student has 
shown themselves (through feedback) to be uncertain about. 

Du and Ling flAj proposed a strategy called ‘worst pre¬ 
dicted’, here S^p, which is related to uncertainty sampling 
commonly used in Active Learning 1^ . However, un¬ 
like in Active Learning, in Machine Teaching, the computer 
does have access to the ground truth labels. Their strategy 
selects the next teaching image as the one whose prediction 
deviates most from the ground truth, 

X( = argminA(27|x), (2) 

X 

where y = argmax^ P{y\^) is the ground truth class label 
known to the teacher. The disadvantage of this approach is 
that it is prone to proposing outliers as teaching images, as 
they tend to be highly uncertain under the current model. 


One potential solution to this problem is to weight the dat- 
apoints by some measure of local density in the feature 
space e.g. Eniia. 

3.1.1 Expected Error Reduction Teaching 

Our teaching strategy, which we refer to as Seer^ takes in¬ 
spiration from optimal sampling methods found in Active 
Learning ||36l|45l[29l. Unlike S^p, Seer chooses the teach¬ 
ing image which, if labeled correctly, would have the great¬ 
est reduction on the future error over the images that are not 
in the teaching set, Vu = V\Vt, where 

Xt = argmin ^ (1 - \ Xi)). (3) 

^ yii,yieVu 

^ I (X IJ ^ 

Here, ^ is the updated estimate of the student’s 

conditional distribution if they were shown and in turn 
labeled it correctly. This strategy has the advantageous 
property that it first concentrates on regions of high den¬ 
sity in the feature space, and as the student improves, re¬ 
fines the boundaries between these regions. In the context 
of Active Learning, this is referred to as the exploration ver¬ 
sus exploitation trade off. This is related to the approach to 
learning advocated by curriculum learning, which focuses 
on easy concepts first and progressively increases the diffi¬ 
culty 121 . 

3.2. Modeling the Student 

In this work we approximate the student’s conditional 
distribution given the teaching set. Pi |x, using graph 
based semi-supervised learning ||4TJ[44l. Using the Gaus¬ 
sian Random Field (GRF) semi-supervised method of 1^ . 
we can propagate the student’s estimate of the class labels 
for the current teaching set, Vf, to the unobserved images 
Vu by defining a similarity matrix W G The ben¬ 

efit of using a graph based approach is that we do not need 
to work directly in feature space, and can instead use the 
similarity, Wij, between image pairs. This gives us the flex¬ 
ibility of allowing similarity to be defined using feature vec¬ 
tors extracted from the images, human provided attributes, 
or using distance metric learning 1 ^ . 

If we are given a feature representation for our teaching 
set, one common approach for computing the similarity Wij 
between two images uses an RBF kernel 

Wij = exp(- 7 ||xj - XjWl), (4) 

where 7 is a length scale parameter that controls how much 
neighboring images influence each other. Using matrix no¬ 
tation of HH, we define an x C matrix F = Pi {y 
where each element fie = PiiVi = c | x^). We can propa¬ 
gate information from the labels provided by the student for 


the teaching set, encoded as a |Pt| x U matrix Ft, to the 
unlabeled images Fu, 

Fu = {Sun - Wuu)-^WutFt, (5) 

where S' is a diagonal matrix with entries sa = Wij . All 
entries in Ft are 0, except where the human learner has esti¬ 
mated (correctly or incorrectly) the class label c for teaching 
image x^, which we set to fie = 1. Wuu is the similarity 
matrix for the unobserved images, a subset of the full ma¬ 
trix W. As in EH, we can efficiently evaluate Equation 
0 using standard matrix operations for datasets featuring 
2000 images in under one second using unoptimized Python 
code. 

4. Experiments 

To validate our proposed multi-class teaching strategy, 
we performed studies on real human subjects. Participants 
were recruited through Mechanical Turk m , and interacted 
with our system remotely using our custom made web in¬ 
terface, built using the Python-based Django web frame¬ 
work (21 . 

4.1. Data 

For our experiments, we selected four different datasets, 
summarized in Table To ensure that the teaching tasks 
were challenging to participants and one-shot learning was 
not possible, we chose datasets with small inter-class varia¬ 
tion and large intra-class variation. Example images from 
each of the classes are presented in Figure Unlike 
standard classification datasets featuring everyday objects 
[161, our datasets contain image categories that 
are challenging for non-domain experts to discriminate be¬ 
tween, as they are made up of uncommon classes. 

Two of the datasets, ‘Butterflies’ and ‘Seabed’, were col¬ 
lated by the authors of this paper from ongoing scientific 
studies into visual species identification. ‘Butterflies’ is a 
subset of a larger collection of British butterfly images from 
a museum collection captured over a period of 100 years. 
‘Seabed’ is a set of images of underwater species taken from 
a study attempting to measure the effects of trawling on un¬ 
derwater bio-diversity. Both datasets were curated and an¬ 
notated by domain experts. 


Dataset 

# Classes 

# Images per Class 

Origin 

Chinese 

3 

237-240 

Eg 

Butterflies 

5 

300 

- 

Seabed 

4 

100 

- 

Leaves 

4 

102-150 

m 


Table 1: Summary of the datasets used, showing the num¬ 
ber of classes and the minimum and maximum number of 
images per class. 
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Figure 2: Example images from the four datasets used in our experiments. Each column shows three random images per class. 
Note, that these images are challenging to categorize as they exhibit a large amount of intra-class variation. Additionally, the 
‘Seabed’ images are particularly difficult as they were captured ‘in the wild’ and contain occlusion and clutter. 


Image features were extracted using the publicly- 
available ConvNet system of IJT]. For each dataset, we 
computed features using a network pre-trained on the Im- 
ageNet 2012 challenge dataset (37). We then fine-tuned the 
fully connected layers using the known ground truth class 
labels for each of our datasets, which produced a separate 
ConvNet for each dataset. To construct the similarity ma¬ 
trix W of we reduced the dimensionality of the Con¬ 
vNet features from 4096 to 50 using PCA, and set the length 
scale parameter 7 to 0.025 for all datasets. In our initial ex¬ 
periments, we explored custom-designed HoG-based fea¬ 
tures cni which we found to perform worse compared to 
our fine-tuned ConvNet. Here, the additional supervised in¬ 
formation provided during fine-tuning produces a represen¬ 
tation where images from the same class are more smoothly 
distributed in feature space. A feature space that is better 
aligned with the student’s view of similarity should benefit 
all probabilistic strategies equally. It would also be possible 
to compute similarity between teaching images by crowd¬ 
sourcing image rankings from a set of users, e.g. (321. How¬ 
ever, we found our ConvNet features to be a good balance 
between reducing the amount of additional supervised in¬ 
formation required for each teaching task and real students’ 
performance. Code and data are available on our project 
website. 

4.2. Experimental Design 

To evaluate our teaching algorithm, we conducted ex¬ 
periments on participants recruited through Mechanical 
Turk lUl. Previously, Crump et al. (91 have shown that it 
is possible to replicate results from classic category learn¬ 
ing experiments using Mechanical Turk. Using a similar 
experimental setup to ED, our participants were first pre¬ 
sented with a sequence of teaching images, which were then 
followed by a sequence of testing images. 

For each experiment, participants were first told how 
many classes they were being asked to learn. Then teaching 
commenced using the interactive teaching loop illustrated 


in Figure For each teaching image, participants were 
first shown the image, asked to estimate its class label by 
clicking on the corresponding button in our web interface, 
and then provided with the correct answer. After receiving 
the estimated class label from the participant, the teaching 
strategy updates its model of the student and chooses the 
next image to be shown. In contrast to the teaching phase, 
no corrective feedback in the form of the true class labels 
was provided in the testing phase. The testing round was 
only used for evaluation purposes and is not necessary in 
real teaching scenarios. Test images were randomly cho¬ 
sen for each participant, with an equal from each class, and 
were excluded from the possible teaching set. 

Each participant was presented with a random dataset 
from Table combined with a random teaching strategy. 
For each dataset, the number of teaching images shown was 
set to three times the number of classes, and ten times the 
number for testing. In this way, the lengths of the teaching 
and testing rounds were proportional to the complexity of 
the task. We experimented with longer teaching rounds (> 
40 images) and testing at regular intervals between teach¬ 
ing images to achieve a learning curve. However, we found 
through feedback that students became bored and frustrated 
with the enforced delay, encouraging them to drop out. 

It is worth noting that our teaching tasks are significantly 
more difficult than most crowd-sourced image annotation 
tasks. In typical annotation tasks, workers already possess 
strong prior knowledge of the concepts involved, whereas 
in our teaching tasks, the participants were unlikely to have 
prior domain expertise. We surveyed participants at the start 
of the task to ensure that they possessed no prior task knowl¬ 
edge and we rejected results for those who claimed to have 
even moderate familiarity of any of the classes. As such, 
the student’s answer to the first teaching image was always 
a random guess. To avoid workers who were seemingly 
clicking at random, we also rejected results for those whose 
average response time per image was too fast (< 3 sec¬ 
onds) during testing. To encourage a conscientious effort 






















in learning, we paid workers a bonus if they scored higher 
than a threshold during testing. After discarding noisy par¬ 
ticipants, we collected results from between 25 and 35 par¬ 
ticipants per strategy/dataset combination. 

4.3. Baseline Strategies 

In addition to the baseline teaching strategies outlined in 
Section EH we also compared to two other baselines Sec 
and Sbatch- For Sec, or class centroids, we computed the 
feature space centroids for each class for a given dataset, 
and students were only presented with the images repre¬ 
sented by these centroids during teaching. Teaching im¬ 
ages were selected by randomly choosing from one of these 
centroids. If there was little intra-class variation, if one- 
shot learning was possible, or if the classes were familiar 
to the student, we would expect this baseline to perform 
very well. The final baseline, Sijatch, is similar to offline 
batch teaching algorithms such as 13^ . Here, the order¬ 
ing of the teaching images was computed offline. We com¬ 
puted the ordering using the Seer algorithm, but assuming 
that if shown an image, the student would always label it 
correctly. Given this assumption, the selection of teaching 
images is deterministic and is identical for all students re¬ 
gardless of their responses. Recent strategies for offline bi¬ 
nary teaching, such as (391, are not directly applicable for 
comparison because we operate in the challenging interac¬ 
tive multi-class classification scenario. 

4.4. Human Experiments 

Results from human participants are summarized in Ta¬ 
ble!^ Results for individual datasets are depicted in Fig¬ 
ure^ where the average number of testing images answered 
correctly are shown for each dataset and strategy combi¬ 
nation. We can see that our Seer method outperforms the 
other teaching strategies on the ‘Chinese’, ‘Butterflies’, and 
‘Seabed’ datasets. In these three, our method is consistently 
the best performing, while the other methods vary in per¬ 
formance depending on the specific dataset. As we can see 
from Table there is no clear ‘second-best’ method, and 
the offline Sbatch and uncertainty S^p strategies are often 
outperformed by random Smd- Seer^ performance is most 
pronounced on the ‘Seabed’ dataset, which also contains the 
most haphazard images, due to the acquisition of data from 
cameras ‘in the wild’, as opposed to neatly-framed imaging 
in controlled laboratory conditions. 

Average timings during testing for the different strate¬ 
gies, calculated as the time between being shown a test im¬ 
age and submitting an answer, are presented in Table 
Participants taught using our method tend to answer more 
quickly compared to the other strategies. Smd and Sec also 
have low response times, but students’ poorer performance 
at test time possibly indicates a level of false-confidence. 

Table [^provides p-values for the statistical significance 


Strategy 

Ave. Time (ms) 

Ave. Score 

Random 

Srnd 

4876 

0.67 

Centroids 

See 

4706 

0.58 

Worst Pred. 

Swp 

5237 

0.66 

Batch 

Sbateh 

6216 

0.64 

EER (Ours) 

Seer 

4659 

0.73 


Table 2: Average participant response times during testing, 
and test set scores across all datasets. 

of our results. Two-tailed tests were conducted with a null 
hypothesis that the distributions of scores for our method 
across all datasets, and the competing method, are statisti¬ 
cally similar, based on a Gaussian assumption. The p-values 
obtained are well within the standard measure of 0.05 for 
testing statistical significance, indicating that our results are 
not due to chance. 

Figure shows the average learning curves for the five 
teaching strategies obtained during teaching. The average 
score for each 10% progress interval (through the training 
set) is calculated by averaging the number of correct re¬ 
sponses over all students and datasets at that point along 
the teaching phase. Note that this is not equivalent to the 
true learning curve, as images are chosen to actively teach 
the student, rather than to assess a snapshot of their perfor¬ 
mance. We see a general trend of improving recognition 
rates with further teaching images. However, Sec gives a 
false sense of performance because the same centroid im¬ 
ages are repeatedly shown, thus the student overfits to these 
images and typically fails to generalize during testing. Un¬ 
like the others, the uncertainty based S^p strategy has a rel¬ 
atively fiat learning curve, because the outlier images shown 
are challenging to learn. This underfilling gives students 
only a weak understanding of each class’s variability. 

Figure shows examples of the teaching images shown 
to students for each of the five strategies with the ’Chinese’ 
dataset. We see the capacity of Seer to adapt to incorrect 
responses, where attention is given to the ’Stem’ class due 
to an incorrect previous answer, before returning to teach 
’Grass’ due to its previous incorrect answer, and finally ex¬ 
ploring the student’s understanding of ’Mound’. On the 
other hand. Snatch is unable to adapt its teaching set and fo¬ 
cuses on teaching ’Mound’ and ’Stem’ despite the student’s 


Strategy 

P-value 

Random 

Srnd 

0.0138 

Centroids 

See 

< 0.0001 

Worst Pred. 

Swp 

0.0027 

Batch 

Sbateh 

< 0.0001 


Table 3: Two-tailed p-values for hypothesis tests on the sta¬ 
tistical significance of our method compared to all others. 












Figure 3: Human experiment results across the four datasets described in Table[^ showing the average scores after the testing 
phase across all participants. Human participants on Mechanical Turk using our Expected Error Reduction based teaching 
strategy (here EER) tend to have better recognition performance on average after teaching, compared to the other baselines. 


poor performance with ’Grass’. Sy^p begins by display¬ 
ing reasonable examples, but ends by attempting to teach 
very unusual examples which are not representative of the 
dataset’s distribution. 

The performance of Seer on the ‘Leaves’ dataset shows 
an example where we do not perform better than the ran¬ 
dom baseline, but come joint second. A property unique to 
this dataset is the multi-modal nature of the leaves present, 
where each class in fact represents an entire genus, com¬ 
posed of a number of different species that do not all look 
the same. We found that human learners typically assumed 
unimodal distributions during teaching and would often fo¬ 
cus on only a single species within the entire genus. 



Eigure 4: Average learning curves across all students and 
datasets during the teaching phase. 

4.5. Limitations 

Currently, our model does not attempt to directly recover 
from incorrect responses made by students in the past. If 
a student has previously given an incorrect answer, future 
teaching images will be selected from similar regions in the 
feature space. However, earlier incorrectly labeled images 
will still influence the label propagation. Allowing incor¬ 


rectly labeled images to be relabeled could result in the 
teaching strategy continually presenting the same images 
until they are correctly labeled. This behavior would be ap¬ 
propriate for a machine learner, but a human would quickly 
learn to cheat the learning task. Any revision style strategy 
would have to be carefully designed to ensure that concepts 
that are already learned are not continually revisited due to 
a de-emphasizing of earlier teaching answers. 

5. Conclusion 

Machine Teaching has the potential to enable humans 
to learn concepts without human-to-human expert tutor¬ 
ing. By automatically adapting the curriculum to a stu¬ 
dent’s ability and performance, teaching can be performed 
in situations where it is difficult or prohibitively costly to 
get direct access to domain-level expertise from a human 
teacher. In this work, we have taken a step in this direction 
by proposing an interactive multi-class teaching strategy. Its 
objective is to present to the student the teaching images that 
will be most informative, given an online estimate of their 
current knowledge. Unlike other proposed strategies, we 
are less likely to teach outliers, and as a result, do not waste 
time showing unrepresentative images. Similar to curricu¬ 
lum learning O, our strategy initially focuses on represen¬ 
tative images and then introduces more difficult ones over 
time, as the student’s performance improves. 

5.1. Future Work 

Currently, we present teaching images to the students 
one at a time. In future, we plan to investigate different 
methods for displaying images. Visualizations such as pair¬ 
wise comparisons (221, and highlighting local regions ifTTl 
or parts (61, may prove more effective at conveying discrim¬ 
inative details and characteristics of different categories. 
Some images are intrinsically more ‘memorable’ than oth¬ 
ers (IHlJOl, and incorporating such measures into teaching 
image selection may also improve test time performance. 

In curriculum learning, task difficulty is increased as per¬ 
formance improves. In future work, we shall also investi¬ 
gate other teaching paradigms such as the spiral approach 
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Figure 5: Example images and responses for the different teaching strategies from 5 sample individual students during 
teaching of the ’Chinese’ dataset. Solid boxes indicate correct answers, dashed lines for incorrect answers, and box colors 
indicate the ground truth class labels. 


to teaching (T] . In spiral learning, new categories are intro¬ 
duced over time while continually re-emphasizing the ear¬ 
lier concepts to ensure that they become committed to mem¬ 
ory. 

Given that we can now teach humans visual categoriza¬ 
tion tasks in an automated fashion, in future work we intend 
to investigate what additional information we can extract 
from our students during and after teaching. In contrast to 
machines, studies suggest that humans can learn with ide¬ 
alized versions of data that can have a different distribu¬ 
tion from the test set ini. Exploring teaching as a domain 
adaptation problem could allow us to acquire annotations 
for data which is very different from our teaching set. Fi¬ 
nally, we have assumed that our feature space is correlated 
with a student’s concept of similarity. It may be more effec¬ 
tive to jointly estimate both the student’s current ability and 
their notion of similarity during teaching. 
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